Geographical analysis of media flows

A multidimensional approach

Claude Grasland (Université de Paris (Diderot), FR 2007 CIST, UMR 8504 Géographie-cités)

featured

Introduction

1 DATA COLLECTION

1.1 Importation of RSS

1.1.1 The Mediacloud database

(tbd : presentation of the MediaCloud project)

Mediacloud can be freely used by researchers. All you have to do is to create an account at the following adress :

https://explorer.mediacloud.org

You have different ways to get title of news. We will focus here on a simple example of data obtained through the mediacloud interface. We suppose that you want to extract news from the Tunisian newspapers speaking from Europe.

1.1.2 Selection of media with source manager

We use the application called Source Manager and we introduce a research by collection which is the most convenient to explore what is available in a country. In our example, the target country is Tunisia and we have three collections that are propsed :

We have selected the collection named “Tunisia National” because we are interested in the most important newspapers of the country.

The buble graphic on the right indicates immediately the media that has produced the highest number of news, but it is wise to explore in more details the list on the left which indicates for each media the statting date of data collection.

When a media appears interesting, we click on its name to obtain a brief summary of the metadata. For example, in the case of L’économiste Maghrebin the metadata indicates :

The media looks promising, but before to go further, it can be better to have a look at the website of the media to have a more concrete idea of the content if we don’t know in advance what it is about in terms of content, what is the ideological orientation, etc.

Here we can see that this is an ecnomic journal, published in french, with news organized in concentric geographic circles (Nation > Maghreb > Africa > World) which is precisely what we are looking for in the IMAGEUN project. We will further complete the informations about this, but before to do that we have to check in more details if the production of the media is regular through time with another tool offered by mediacloud, the explorer.

1.1.3 Checking the stability through time

We have clicked on search in explorer on the metadata page of the Source Manager and obtain a news interfacce where we modify the date to cover the full period of collection of the media (or our period of interest). In the research field, we let the search term * which indicates a research on all news.

Below your request, you obtain a graphic entitled Attention Over Time with the distribution of the number of news published per day which help you to verify if the distribution of news is regular through time. You just have to modify the type of graphic in order to visualize Story Count and you can choose the time span you want (day, week or month) for the evaluation of the regularity of news flow. In our example, we notice that at daily level they are some brief period of break in 2019, but the flow is reasonnabely regular with approximatively 5 news per day at the beginning and 10 to 20 in the final period. We also notice a classical week cycle with a decrease of news published during the week-end.

Going down, you will find a news panel entitled Total Attention which gives you the total number of stories found. In our example, we have a total of 13626 stories produced by our media over the period.

1.1.5 Download and storage of news

According to your selection (all news or a specific topic) you will download more or less title. Here, me make the choice to get all news, which means that we have to repeat the original request with *.

Finally, by clicking on the button Download all story URLS, you can get a .csv file that you can easily load in your favorite programming language as we will see in the next section.

1.2 Corpus creation

knitr::opts_chunk$set(cache = TRUE,
                        echo = TRUE,
                        comment = "")

In the previous section (ref…) whe have obtained a .csv file of news collected from MediaCloud. We will try now to put this data in a standard form and we have chosen the format of the quanteda package as reference for data organization and storage.

But of course the researchers involved in the project can prefer to use other R packages like tm or tidytext. And they can also prefer to use another programming language for Python. It is the reason why we explain how to transform and export the data that has been prepared and harmonized with quanteda in various format like .csv or JSON.

We detail here an example of importation with the example of the newspaper “L’économiste maghrebin”

1.2.1 Importation of text to R

This step is not always obvious because many problems of encoding can appear that are more or less easy to solve. In principle , the data from Media Cloud are exported in standard UTF-8 but as we will see it is not necessary the case.

We try firstly to use the standard R function read.csv():

store <- "data"
  media <- "fr_TUN_ecomag"
  type <-".csv"
  
  fic <- paste(store,"/",media,type,sep="")
  
  df<-read.csv(fic,
               sep=",",
               header=T,
               encoding = "UTF-8",
               stringsAsFactors = F)
  kable(head(df))
stories_id publish_date title url language ap_syndicated themes media_id media_name media_url
1129295780 2019-01-02 03:42:46 Les tarifs de l’ADSL réduits à partir du 1er janvier 2019 https://www.leconomistemaghrebin.com/2019/01/02/tarifs-adsl-reduits-1-janvier-2019/ fr False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/
1129295771 2019-01-02 04:06:27 6ème Sfax Marathon International des Oliviers https://www.leconomistemaghrebin.com/2019/01/02/sfax-marathon-international-oliviers/ fr False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/
1129295760 2019-01-02 06:05:08 Télécharger la version finale de la Loi de finances 2019 https://www.leconomistemaghrebin.com/2019/01/02/telecharger-la-version-finale-de-la-loi-de-finances-2019/ en False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/
1129578051 2019-01-02 10:05:06 Chawki Tabib : 245 dossiers de corruption présumée transmis au ministère public https://www.leconomistemaghrebin.com/2019/01/02/chawki-tabib-245-dossiers-transferes-au-ministere-public/ fr False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/
1129461662 2019-01-02 07:52:36 Panoro Energy finalise l’acquisition de OMV Tunisia https://www.leconomistemaghrebin.com/2019/01/02/panoro-energy-finalise-lacquisition-de-omv-tunisia/ fr False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/
1129461636 2019-01-02 08:57:54 La partie syndicale maintient le boycott des examens du secondaire https://www.leconomistemaghrebin.com/2019/01/02/partie-syndicale-boycott-examens-secondaire/ fr False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/

The importation was successfull for 12794 news but message of errors appeared for 3 news where R sent a message of error telling :

Error in gregexpr(calltext, singleline, fixed = TRUE) : regular expression is invalid UTF-8

Looking in more details, we discover also some problems of encoding in news like in the following example where the text of the news appears differently if we apply the standard functions paste() o0 the specialized function r knitr::kable for printing.

paste(df[9, 3])
[1] "Néji Jalloul : &#8220;Nidaa Tounes peut revenir si&#8230;&#8221;"
kable((df[9,3]))
x
Néji Jalloul : “Nidaa Tounes peut revenir si…”

1.2.2 Resolution of encoding problems

It is sometime possible to adapt manually the encoding problem whan they are not too much as in present example.

df$text<-df$title
  # standardize apostrophe
  df$text<-gsub("&#8217;","'",df$text)
  
  # standardize punct
  df$text<-gsub('&#8230;','.',df$text)
  
  # standardize hyphens
  df$text<-gsub('&#8211;','-',df$text)
  
  # Remove quotation marks
  df$text<-gsub('&#171;&#160;','',df$text)
  df$text<-gsub('&#160;&#187;','',df$text)
  df$text<-gsub('&#8220;','',df$text)
  df$text<-gsub('&#8221;','',df$text)
  df$text<-gsub('&#8216;','',df$text)
  df$text<-gsub('&#8243;','',df$text)

We can introduce other cleaning procedures here or keep it for later analysis

1.2.3 Transformation in quanteda format

We propose a storage based on quanteda format by just transforming the data that has been produced by readtext. We keep only the name of the source and the date of publication.

# Create Quanteda corpus
  qd<-corpus(df,docid_field = "stories_id")
  
  
  # Select docvar fields and rename media
  qd$date <-as.Date(qd$publish_date)
  qd$source <-media
  docvars(qd)<-docvars(qd)[,c("source","date")]
  
  
  
  
  # Add global meta
  meta(qd,"meta_source")<-"Media Cloud "
  meta(qd,"meta_time")<-"Download the 2021-09-30"
  meta(qd,"meta_author")<-"Elaborated by Claude Grasland"
  meta(qd,"project")<-"ANR-DFG Project IMAGEUN"

We have created a quanteda object with a lot of information stored in various fields. The structure of the object is the following one

str(qd)
 'corpus' Named chr [1:12794] "Les tarifs de l'ADSL réduits à partir du 1er janvier 2019" ...
   - attr(*, "names")= chr [1:12794] "1129295780" "1129295771" "1129295760" "1129578051" ...
   - attr(*, "docvars")='data.frame': 12794 obs. of  5 variables:
    ..$ docname_: chr [1:12794] "1129295780" "1129295771" "1129295760" "1129578051" ...
    ..$ docid_  : Factor w/ 12794 levels "1129295780","1129295771",..: 1 2 3 4 5 6 7 8 9 10 ...
    ..$ segid_  : int [1:12794] 1 1 1 1 1 1 1 1 1 1 ...
    ..$ source  : chr [1:12794] "fr_TUN_ecomag" "fr_TUN_ecomag" "fr_TUN_ecomag" "fr_TUN_ecomag" ...
    ..$ date    : Date[1:12794], format: "2019-01-02" "2019-01-02" ...
   - attr(*, "meta")=List of 3
    ..$ system:List of 6
    .. ..$ package-version:Classes 'package_version', 'numeric_version'  hidden list of 1
    .. .. ..$ : int [1:3] 3 0 0
    .. ..$ r-version      :Classes 'R_system_version', 'package_version', 'numeric_version'  hidden list of 1
    .. .. ..$ : int [1:3] 4 1 0
    .. ..$ system         : Named chr [1:3] "Windows" "x86-64" "claude"
    .. .. ..- attr(*, "names")= chr [1:3] "sysname" "machine" "user"
    .. ..$ directory      : chr "C:/git/geomedia"
    .. ..$ created        : Date[1:1], format: "2021-11-25"
    .. ..$ source         : chr "data.frame"
    ..$ object:List of 2
    .. ..$ unit   : chr "documents"
    .. ..$ summary:List of 2
    .. .. ..$ hash: chr(0) 
    .. .. ..$ data: NULL
    ..$ user  :List of 4
    .. ..$ meta_source: chr "Media Cloud "
    .. ..$ meta_time  : chr "Download the 2021-09-30"
    .. ..$ meta_author: chr "Elaborated by Claude Grasland"
    .. ..$ project    : chr "ANR-DFG Project IMAGEUN"

We can look at the first titles with head()

kable(head(qd,3))
x
1129295780 Les tarifs de l’ADSL réduits à partir du 1er janvier 2019
1129295771 6ème Sfax Marathon International des Oliviers
1129295760 Télécharger la version finale de la Loi de finances 2019

We can get meta information on each stories with summary()

summary(head(qd,3))
Corpus consisting of 3 documents, showing 3 documents:

         Text Types Tokens Sentences        source       date
   1129295780    11     11         1 fr_TUN_ecomag 2019-01-02
   1129295771     6      6         1 fr_TUN_ecomag 2019-01-02
   1129295760     8     10         1 fr_TUN_ecomag 2019-01-02

We can get meta information about the full document

meta(qd)
$meta_source
  [1] "Media Cloud "

  $meta_time
  [1] "Download the 2021-09-30"

  $meta_author
  [1] "Elaborated by Claude Grasland"

  $project
  [1] "ANR-DFG Project IMAGEUN"

1.2.4 Storage of the quanteda object

We can finally save the object in .RDS format in a directory dedicated to our quanteda files. It can be usefull to give some information in the name of the file

store <- "data"
  type<- ".RDS"
  myfile <- paste(store,"/",media,type,sep="")
  myfile
[1] "data/fr_TUN_ecomag.RDS"
saveRDS(qd,myfile)
  qd[1:3]
Corpus consisting of 3 documents and 2 docvars.
  1129295780 :
  "Les tarifs de l'ADSL réduits à partir du 1er janvier 2019"

  1129295771 :
  "6ème Sfax Marathon International des Oliviers"

  1129295760 :
  "Télécharger la version finale de la Loi de finances 2019"
summary(qd,3)
Corpus consisting of 12794 documents, showing 3 documents:

         Text Types Tokens Sentences        source       date
   1129295780    11     11         1 fr_TUN_ecomag 2019-01-02
   1129295771     6      6         1 fr_TUN_ecomag 2019-01-02
   1129295760     8     10         1 fr_TUN_ecomag 2019-01-02

We have kept all the information present in the initial file, but also added specific metadata of interest for us. The size of the storage is now equal to 0.6 Mb which means a division by 6 as compared to the initial .csv file downloaded from Media Cloud where the size was 3.8 Mb.

1.2.5 Back transformation to tibble

In the following steps, we will make an intensive use of quanteda, but sometimes it can be useful to export the results in a more practical format or to use other packages. For this reasons, it is important to know that the tidytextpackage can easily transform quanteda object in tibbles which are more classical and easy to manage and to export in other formats like data.frame or data.table.

td <- tidy(qd)
  kable(head(td))
text source date
Les tarifs de l’ADSL réduits à partir du 1er janvier 2019 fr_TUN_ecomag 2019-01-02
6ème Sfax Marathon International des Oliviers fr_TUN_ecomag 2019-01-02
Télécharger la version finale de la Loi de finances 2019 fr_TUN_ecomag 2019-01-02
Chawki Tabib : 245 dossiers de corruption présumée transmis au ministère public fr_TUN_ecomag 2019-01-02
Panoro Energy finalise l’acquisition de OMV Tunisia fr_TUN_ecomag 2019-01-02
La partie syndicale maintient le boycott des examens du secondaire fr_TUN_ecomag 2019-01-02
str(td)
tibble [12,794 x 3] (S3: tbl_df/tbl/data.frame)
   $ text  : chr [1:12794] "Les tarifs de l'ADSL réduits à partir du 1er janvier 2019" "6ème Sfax Marathon International des Oliviers" "Télécharger la version finale de la Loi de finances 2019" "Chawki Tabib : 245 dossiers de corruption présumée transmis au ministère public" ...
   $ source: chr [1:12794] "fr_TUN_ecomag" "fr_TUN_ecomag" "fr_TUN_ecomag" "fr_TUN_ecomag" ...
   $ date  : Date[1:12794], format: "2019-01-02" "2019-01-02" ...

Bibliographie

BARNIER, Julien, 2021. rmdformats: HTML Output Formats and Templates for ’rmarkdown’ Documents [en ligne]. S.l. : s.n. Disponible à l'adresse : https://github.com/juba/rmdformats.
R CORE TEAM, 2020. R: A Language and Environment for Statistical Computing [en ligne]. Vienna, Austria : R Foundation for Statistical Computing. Disponible à l'adresse : https://www.R-project.org/.
XIE, Yihui, 2020. knitr: A General-Purpose Package for Dynamic Report Generation in R [en ligne]. S.l. : s.n. Disponible à l'adresse : https://CRAN.R-project.org/package=knitr.

Annexes

Infos session

setting value
version R version 4.1.0 (2021-05-18)
os Windows 10 x64
system x86_64, mingw32
ui RTerm
language (EN)
collate French_France.1252
ctype French_France.1252
tz Europe/Paris
date 2021-11-25
package ondiskversion source
dplyr 1.0.6 CRAN (R 4.1.0)
ggplot2 3.3.3 CRAN (R 4.1.0)
knitr 1.34 CRAN (R 4.1.1)
quanteda 3.0.0 CRAN (R 4.1.0)
readtext 0.80 CRAN (R 4.1.0)
rmarkdown 2.11 CRAN (R 4.1.1)
rzine 0.1.0 gitlab ()
tidytext 0.3.1 CRAN (R 4.1.1)

Citation

@Manual{ficheRzine,
    title = {Titre de la fiche},
    author = {{Auteur.e.s}},
    organization = {Rzine},
    year = {202x},
    url = {http://rzine.fr/},
  }


Glossaire